In [ ]:
%load_ext lab_black

import warnings

warnings.filterwarnings("ignore")

Introduction¶

A decision tree is a simple, easy to interpret modeling technique for both regression and classification problems. Decision trees usually do not perform very well compared to other methods. Their relevance lies in the fact that they are the building blocks of two of the most successful ML algorithms as per today: random forests and gradient boosted trees. In this chapter, we will introduce these tree-based methods.

Decision Trees¶

How they work¶

In our journey to estimate the model $f$ by $\hat f$, we have considered mainly linear functions $f$ so far. We now move to a different function class: decision trees. They have been introduced in 1984 by Leo Breiman, Jerome Friedman and others [1] and are sometimes called "Classification and Regression Trees" (CART).

(Binary) decision trees are calculated recursively by partitioning the data in two pieces. Partitions are chosen to optimize the given average loss by asking the best "yes/no" question about the covariates, e.g., "is carat < 1?" or "is color better than F?".

For regression problems, the most frequently used loss function is the squared error. For classification, its "information" (= cross-entropy = log loss = half the unit logistic deviance) or the very similar Gini impurity.

Predictions are calculated by sending an observation through the tree, starting with the question at the "trunk" and ending in a "leaf". The prediction is the value associated with the leaf. For regression situations, such leaf value typically equals the average response of all observations in the leaf. In classification settings, it may be the most frequent class in the leaf or all class probabilities.

The concept of a decision tree is best understood with an example.

Example: decision tree¶

We will use the dataCar data set to predict the claim probability with a decision tree. As features, we use veh_value, veh_body, veh_age, gender, area and agecat.

In [ ]:
import pandas as pd
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline

# Load data
car = pd.read_csv("car.csv")  # see readme how to get the data

# Preprocessing pipe. For simplicity, we treat 'veh_body' as ordinal
num_vars = ["veh_value", "veh_age", "agecat"]
ord_vars = ["gender", "area", "veh_body"]
all_vars = ord_vars + num_vars  # Use same order in preprocessor(!)
ord_levels = [sorted(car[v].unique()) for v in ord_vars]

preproc = ColumnTransformer(
    transformers=[
        ("ordinal", OrdinalEncoder(categories=ord_levels), ord_vars),
        ("numeric", "passthrough", num_vars),
    ]
)

X = preproc.fit_transform(car)

# Feature matrix
pd.DataFrame(X, columns=all_vars).head()
Out[ ]:
gender area veh_body veh_value veh_age agecat
0 0.0 2.0 3.0 1.06 3.0 2.0
1 0.0 0.0 3.0 1.03 2.0 4.0
2 0.0 4.0 12.0 3.26 2.0 2.0
3 0.0 3.0 10.0 4.14 2.0 2.0
4 0.0 2.0 3.0 0.72 4.0 2.0
In [ ]:
# Fit model
model = DecisionTreeClassifier(criterion="entropy", max_depth=3)
model.fit(X, car["clm"])

print("First five predictions (probabilities for no claim/claim):")
print(model.predict_proba(X[0:5]))
First five predictions (probabilities for no claim/claim):
[[0.9333264  0.0666736 ]
 [0.9333264  0.0666736 ]
 [0.92529273 0.07470727]
 [0.92529273 0.07470727]
 [0.9333264  0.0666736 ]]
In [ ]:
# Visualize and "read" tree
import numpy as np
import matplotlib.pyplot as plt

print("Fitted decision tree: if condition is 'yes', go to left!")
plt.figure(figsize=(16, 10))
tree.plot_tree(
    model, feature_names=all_vars, proportion=True, filled=True, impurity=False
)
plt.show()

print("\nFirst observation (original and encoded)")
(
    car[all_vars]
    .head(1)
    .append(pd.Series(X[0], index=all_vars), ignore_index=True)
    .set_index(np.array(["Original", "Encoded"]))
)
Fitted decision tree: if condition is 'yes', go to left!
First observation (original and encoded)
Out[ ]:
gender area veh_body veh_value veh_age agecat
Original F C HBACK 1.06 3.0 2.0
Encoded 0.0 2.0 3.0 1.06 3.0 2.0

Comments

  • The first observation belongs to a person in age category 2 and has a $10'600 hatchback: the first, second, and third question all send us to the left ("yes"). This gives us a claim probability of 6.7%.
  • How was, e.g., the first question (agecat <= 4.5) chosen? The algorithm scans all covariates for all possible split positions and picks the one with best average loss improvement. In this case, splitting on covariate agecat at the value 4.5 reduced the average loss most.

Properties of decision trees

  • Outliers: In contrast to linear models, outliers in covariates are not an issue because the algorithm only takes into account the sort order of the values. Similarly, taking logarithms in covariates has no effect. Both statements do not hold for the response variable.
  • Missing values: Some implementations can deal with missing values in the input. Alternatively, missing values are often replaced by a typical value or a value smaller/larger than the smallest non-missing value (such as -1 for a positive variable).
  • Categorical covariates: Unordered categorical covariates are tricky to split because with $\ell$ levels, theoretically, we end up with $2^\ell$ possible partitions. Try to lump small categories together or consider representing the levels by ordered categories (even if it does not make too much sense). One-hot-encoding is an option as well. Some algorithms offer ways to internally deal with unordered categoricals.
  • Greedy: Partitions are made in a greedy way to optimize the objective in one step. Looking ahead more than one step would lead to better models but this is computationally too demanding in practice.
  • Interactions: By their flexible structure, a decision tree can automatically capture interaction effects of any order (and other non-linear effects), at least if the data set is large and the tree is deep enough. This is a big advantage of tree-based methods compared to linear models where these elements have to carefully and manually be accounted for. In the next chapter, we will meet another model class with this advantage: neural nets.
  • Extrapolation: By construction, a decision tree cannot extrapolate. So even a whopping 10 carat diamond cannot get a higher price prediction than the most expensive diamond in the training data.

These properties typically translate 1:1 to combinations of trees like random forests or boosted trees.

Random Forests¶

How they work¶

In 2001, Leo Breiman introduced a very powerful tree-based algorithm called random forest, see [2]. A random forest consists of many decision trees. To ensure that the trees differ, two sources or randomness are injected:

  1. Each split scans only a random selection "mtry" of the $m$ covariates to find the best split, usually about $\sqrt{m}$ or $m/3$. "mtry" is the main tuning parameter of a random forest.
  2. Each tree is calculated on a bootstrap sample of the data, i.e., on $n$ observations selected with replacement from the original $n$ rows. This technique is called "bagging", from "bootstrap aggregating".

Predictions are found by pooling the predictions of all trees, e.g., by averaging or majority voting.

Comments about random forests

  • Number of trees: Usually, 100-1000 trees are being grown. The more, the better. More trees also mean longer training time and larger models.
  • Diversification: Single trees in a random forest are usually very deep. They overfit on the training data. It is the diversity across trees that produces a good and stable model, just with a well-diversified stock portfolio.
  • Never trust performance on the training set for random forests.
  • OOB validation: In each tree, about 1/3 of all observations are not in the bootstrap sample just by chance. Put differently: each observation is used in about 2/3 of all trees. If its prediction is calculated from the other 1/3 of the trees, we get an "out-of-sample" prediction, also called "out-of-bag" (OOB) prediction. If rows are independent, model performance derived from these OOB predictions is usually good enough to be used for model validation. Do not use OOB results when rows are dependent such as for grouped samples.
  • Parameter tuning: Random forests offer many tuning parameters. Since the results typically do not depend too much on their choice, untuned random forests are often great benchmark models.

Example: random forest¶

Let us now fit a random forest for diamond prices with typical parameters and 500 trees. 80% of the data is used for training, the other 20% we use for evaluating the performance. We use stratified splitting on the response variable. (Throughout the rest of the lecture, we will ignore the problematic aspect of having repeated rows for some diamonds.)

In [ ]:
from plotnine.data import diamonds
from sklearn.preprocessing import OrdinalEncoder, KBinsDiscretizer
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Stratified train/test split
binner = KBinsDiscretizer(n_bins=10, encode="ordinal")
y_binned = binner.fit_transform(diamonds[["price"]])

df_train, df_test, y_train, y_test = train_test_split(
    diamonds, diamonds["price"], test_size=0.2, random_state=341, stratify=y_binned
)

# Pipeline with model
ord_features = ["cut", "color", "clarity"]
ord_levels = [diamonds[x].cat.categories.to_list() for x in ord_features]

model = make_pipeline(
    ColumnTransformer(
        transformers=[
            ("ordinal", OrdinalEncoder(categories=ord_levels), ord_features),
            ("numeric", "passthrough", ["carat"]),
        ],
        verbose_feature_names_out=False,
    ),
    RandomForestRegressor(
        n_estimators=500,
        max_features="sqrt",
        min_samples_leaf=5,
        oob_score=True,
        random_state=17,
        n_jobs=4,
    ),
)
model.fit(df_train, y_train)

# OOB performance
print(f"OOB R-squared: {model[-1].oob_score_:.2%}")
OOB R-squared: 98.07%
In [ ]:
# Test performance
rmse_test = mean_squared_error(y_test, model.predict(df_test), squared=False)
print(f"Test RMSE: {rmse_test:.3f}")
print(f"Test R-squared: {model.score(df_test, y_test):.2%}")
Test RMSE: 554.735
Test R-squared: 98.05%

Comments

  • Performance is excellent.
  • The OOB estimate of performance is extremely close to the test set performance.
  • Interpretation?

Interpreting a "black box"¶

In contrast to a single decision tree or a linear model, a combination of many trees is not easy to interpret. It is good practice for any ML model to study at least variable importance and the strongest effects, not just its performance. A pure prediction machine is hardly of any interest and might even contain mistakes like using covariates derived from the response. Model interpretation helps to fight such problems and thus also to increase trust in a model.

Variable importance¶

There are different approaches to measure the importance of a covariate. Since there is no general mathematical definition of "importance", the results of different approaches might be inconsistent across each other. For tree-based methods, a usual approach is to measure how many times a covariate $X$ was used in a split or how much total loss improvement came from splitting on $X$.

Approaches that work for any supervised model (including neural nets) include permutation importance and SHAP importance.

Effects¶

One of the main reasons for the success of modern methods like random forests is the fact that they automatically learn interactions between two or more covariates. Thus, the effect of a covariate $X$ typically depends on the values of other covariates. In the extreme case, the effect of $X$ is different for each observation. The best what we can do is to study the average effect of $X$ over many observations, i.e., averaging the effects over interactions. This leads us to partial dependence plots: They work for any supervised ML model and are constructed as follows: A couple of observations are selected. Then, their average prediction is visualized against $X$ when sliding their value of $X$ over a reasonable grid of values, keeping all other variables fixed. The more natural the Ceteris Paribus clause, the more reliable the partial dependence plots.

Remark: A partial dependence plot of a covariate in a linear regression is simply a visualization of its coefficient.

Alternatives to partial dependence plots include accumulated local effect plots and SHAP dependence plots. Both relax the Ceteris Paribus clause.

Example: random forest (continued)¶

For our last example, we will now look at variable importance and partial dependence plots.

In [ ]:
# Variable importance
import pandas as pd

feature_names = model[:-1].get_feature_names_out()
imps = pd.Series(model[-1].feature_importances_, index=feature_names)
_ = imps.sort_values().plot(kind="barh", title="Split gain importance")
In [ ]:
# Partial dependence plots
# Note: As soon as sklearn.inspect.PartialDependenceDisplay
# allows for passing a character grid, it is an option as well.
import dalex as dx
import plotly

plotly.offline.init_notebook_mode()  # for saving html with plotly plots

# Define explainer
exp = dx.Explainer(model, data=df_train, verbose=False)

# Plots
pdp_num = exp.model_profile(
    variables=["carat"], label="Partial depencence for 'carat'", verbose=False
)
pdp_num.plot()

pdp_ord = exp.model_profile(
    variable_type="categorical",
    variable_splits=dict(zip(ord_features, ord_levels)),
    label="Partial depencence for ordinal variables",
    verbose=False,
)
pdp_ord.plot(facet_scales="free")

Comments

  • As expected, carat is the most important predictor.
  • All effects as assessed by partial dependence make sense.

Exercises¶

  1. In above example, replace carat by its logarithm. Do the results change compared to the example without logs?
  2. Fit a random forest on the claims data for the binary variable clm using covariates veh_value, veh_body, veh_age, gender, area, and agecat. Choose a suitable tree depth either by cross-validation or by minimizing OOB error on the training data. Make sure to fit a probability random forest, i.e., predicting probabilities, not classes. Additionally, make sure to work with a relevant loss function (information/cross-entropy or Gini gain). Evaluate the final model on an independent test dataset. Interpret the results by split gain importance and partial dependence plots.

Gradient Boosted Trees¶

The idea of boosting was introduced by Schapire in 1990 [3] and roughly works as follows: A simple model is fit to the data. Then, another simple model is added, trying to correct the errors from the first model. This process is repeated until some stopping criterion triggers. As simple models or base learners, usually small decision trees are used, an idea pushed by Jerome Friedman in his famous 2001 article on the very general framework of gradient boosting machines [4].

Modern variants of such gradient boosted trees are XGBoost, LightGBM and CatBoost. These are the predominant algorithms in ML competitions on tabular data, check this comparison for differences with a screenshot as per Sept. 25, 2020:

Predictions are calculated similar to random forests, i.e., by combining predictions of all trees. As loss/objective function, one can choose among many possibilities. Often, using the same loss function as a corresponding GLM is a good choice.

Example: XGBoost (untuned)¶

As an initial example on gradient boosting and XGBoost, we fit a model for diamond prices with squared error as loss function. The number of rounds/trees is chosen by early stopping on the validation data, i.e., until validation (R)MSE stops improving for a couple or rounds. The learning rate (weight of each tree) is chosen by trial and error in order to end up with a reasonably small/large number of trees, see the next section for more details.

In [ ]:
# We can use either the native XGBoost interface or its Scikit-Learn
# interface. We will use the former.
import xgboost as xgb
from plotnine.data import diamonds
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_transformer

# Train/valid split
df_train, df_valid, y_train, y_valid = train_test_split(
    diamonds, diamonds["price"], test_size=0.2, random_state=937
)

# Preprocessing
ord_features = ["cut", "color", "clarity"]
ord_levels = [diamonds[x].cat.categories.to_list() for x in ord_features]

encoder = make_column_transformer(
    (OrdinalEncoder(categories=ord_levels), ord_features), ("passthrough", ["carat"])
)

X_train = encoder.fit_transform(df_train)
X_valid = encoder.transform(df_valid)
In [ ]:
# XGBoost data interface
dtrain = xgb.DMatrix(X_train, label=y_train)
dvalid = xgb.DMatrix(X_valid, label=y_valid)

# Compact set of parameters
params = {"learning_rate": 0.02, "objective": "reg:squarederror", "eval_metric": "rmse"}
In [ ]:
# Fit model until validation RMSE stops improving over the last 20 rounds
fit = xgb.train(
    params=params,
    dtrain=dtrain,
    num_boost_round=5000,  # a large value
    evals=[(dtrain, "train"), (dvalid, "valid")],  # the last is used for early stopping
    verbose_eval=40,
    early_stopping_rounds=20,
)
[0]	train-rmse:5481.27742	valid-rmse:5536.88586
[40]	train-rmse:2540.34776	valid-rmse:2578.91556
[80]	train-rmse:1272.18425	valid-rmse:1306.55024
[120]	train-rmse:766.21559	valid-rmse:800.65333
[160]	train-rmse:592.72252	valid-rmse:625.90753
[200]	train-rmse:538.89692	valid-rmse:571.34321
[240]	train-rmse:520.53616	valid-rmse:553.62927
[280]	train-rmse:512.15118	valid-rmse:546.68904
[320]	train-rmse:507.19968	valid-rmse:544.09935
[360]	train-rmse:502.98198	valid-rmse:543.11131
[400]	train-rmse:499.58072	valid-rmse:542.69507
[412]	train-rmse:498.50243	valid-rmse:542.76552

Comments:

  • More trees would mean better training performance, but worse validation performance.
  • Compared to simple validation, cross-validation would provide an even better way to find the optimal number of boosting rounds/trees.
  • In the next example, we will show how to interpret such model.

Parameters of gradient boosted trees¶

Gradient boosted trees offer a quite a lot of parameters. Unlike with random forests, they need to be tuned to achieve good results. It would be naive to use an algorithm like XGBoost without parameter tuning. Here is a selection:

  • Number of boosting rounds: In contrast to random forests, more trees/rounds is not always beneficial because the model begins to overfit after some time. The optimal number of rounds is usually found by early stopping, i.e., one lets the algorithm stop as soon as the (cross-)validation performance stops improving, see the example above.

  • Learning rate: The learning rate determines training speed and the impact of each tree to the final model. Typical values are between 0.01 and 0.5. In practical applications, it is set to a value that leads to a reasonable amount of trees (100-1000). Usually, halving the learning rate means twice as much boosting rounds for comparable performance.

  • Regularization parameters: Depending on the implementation, additional parameters are

    • the tree depth (often 3-7) or the number of leaves (often 7-63),
    • the strength of the L1 and/or L2 penalties added to the objective function (often between 0 and 5),
    • the row subsampling rate (often between 0.8 and 1),
    • the column subsampling rate (often between 0.6 and 1),
    • ...

    Reasonable regularization parameters are chosen by trial and error or systematically by randomized or grid search CV. Usually, it takes a couple of iterations until the range of the parameter values have been set appropriately.

Note on tuning: Since learning rate, number of boosting rounds and regularization parameters are heavily interdependent, a "big" randomized grid search CV to choose learning rate, boosting rounds and regularization is often not ideal. Above suggestion (fix learning rate, select number of rounds by early stopping and do grid search only on regularization parameters) is more focussed, see also the example below.

Example: XGBoost (tuned)¶

We will use XGBoost to fit diamond prices using the squared error as loss function and RMSE as evaluation metric, now using a sophisticated tuning strategy.

Overall, the modelling strategy is as follows:

  1. Using default regularization parameters, set the learning rate to get reasonable number of trees with CV-based early stopping.
  2. Use randomized search CV with early stopping to tune regularization parameters such as tree depth. Iterate if needed.
  3. Use the best parameter combination (incl. number of trees) to fit the model on the training data. "Best" typically means in terms of CV performance. As mentioned in the last chapter and depending on the situation, it can also mean "good CV performance and not too heavy overfit compared to insample performance" or some other reasonable criterion.
In [ ]:
# This code is pretty long, but may serve as a general template
import json
from pathlib import Path
import pandas as pd
import xgboost as xgb
from plotnine.data import diamonds
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_transformer

# General settings
grid_file = Path("gridsearch") / "diamonds_xgb.txt"
objective = "reg:squarederror"
metric = "rmse"

# Train/test split
df_train, df_test, y_train, y_test = train_test_split(
    diamonds, diamonds["price"], test_size=0.2, random_state=341
)

# Preprocessing
ord_features = ["cut", "color", "clarity"]
ord_levels = [diamonds[x].cat.categories.to_list() for x in ord_features]

encoder = make_column_transformer(
    (OrdinalEncoder(categories=ord_levels), ord_features),
    ("passthrough", ["carat"]),
    verbose_feature_names_out=False,
)

X_train = encoder.fit_transform(df_train)
X_test = encoder.transform(df_test)

# XGBoost data interface (feature_names are added for importance plot)
feature_names = encoder.get_feature_names_out()
dtrain = xgb.DMatrix(X_train, label=y_train, feature_names=feature_names)
dtest = xgb.DMatrix(X_test, label=y_test, feature_names=feature_names)
In [ ]:
# Step 1: Find learning rate with reasonable tree count (0.02 seems ok)
if True:
    params = {"learning_rate": 0.02, "objective": objective, "eval_metric": metric}

    # Cross-validation
    cvm = xgb.cv(
        params=params,
        dtrain=dtrain,
        num_boost_round=5000,
        nfold=5,
        early_stopping_rounds=20,
        callbacks=[xgb.callback.EvaluationMonitor(period=50)],
    )

    # A LR of 0.02 provides about 370 trees, which is a convenient amount
    print("Best boosting round with default params:\n", cvm.tail(1))
[0]	train-rmse:5495.84421	test-rmse:5495.78773
[50]	train-rmse:2121.53284	test-rmse:2126.02558
[100]	train-rmse:960.61972	test-rmse:972.46793
[150]	train-rmse:619.79613	test-rmse:641.67979
[200]	train-rmse:538.28014	test-rmse:566.70507
[250]	train-rmse:516.24441	test-rmse:549.09343
[300]	train-rmse:506.97159	test-rmse:544.56969
[350]	train-rmse:500.13366	test-rmse:543.60838
[380]	train-rmse:496.58954	test-rmse:543.59365
Best boosting round with default params:
      train-rmse-mean  train-rmse-std  test-rmse-mean  test-rmse-std
360       498.968055        1.814964      543.537763        8.83253
In [ ]:
# Step 2: Iterate randomized SearchCV for regularization parameters
if False:
    from tqdm import tqdm
    from sklearn.model_selection import ParameterSampler  # , ParameterGrid

    # Use ParameterGrid(...) if grid is small enough to check all combinations

    # Final grid search after some iterations
    grid = ParameterSampler(
        {
            "learning_rate": [0.02],
            "objective": [objective],
            "eval_metric": [metric],
            "max_depth": [4, 5, 6],
            "min_child_weight": [1, 10],
            "colsample_bytree": [0.8, 1],
            "subsample": [0.8, 1],
            "reg_lambda": [0, 1, 2],
            "reg_alpha": [0, 1, 2],
            # "tree_method": ["hist"],   # when data is large
            "min_split_loss": [0, 0.0001],
        },
        n_iter=20,
    )

    # Iterate over grid and save relevant information on disk
    search = []
    for g in tqdm(grid):
        cvm = xgb.cv(
            params=g,
            dtrain=dtrain,
            num_boost_round=5000,
            nfold=5,
            early_stopping_rounds=20,
        )
        # Keep number of rounds, cv score, train score, and parameters
        search.append((len(cvm), *cvm.iloc[-1, [2, 0]], g))

        with open(grid_file, "w") as f:
            json.dump(search, f)

# Load grid and check (A) sort order and (B) if grid ranges were set reasonable
with open(grid_file) as f:
    search = json.load(f)

search_df = pd.DataFrame.from_records(
    search, columns=["num_boost_round", "cv_score", "train_score", "params"]
).sort_values("cv_score")

with pd.option_context("display.max_colwidth", None):
    display(search_df.head())
num_boost_round cv_score train_score params
1 430 542.998486 504.787591 {'subsample': 0.8, 'reg_lambda': 2, 'reg_alpha': 2, 'objective': 'reg:squarederror', 'min_split_loss': 0.0001, 'min_child_weight': 10, 'max_depth': 6, 'learning_rate': 0.02, 'eval_metric': 'rmse', 'colsample_bytree': 1}
18 382 543.095764 505.431964 {'subsample': 0.8, 'reg_lambda': 0, 'reg_alpha': 2, 'objective': 'reg:squarederror', 'min_split_loss': 0.0001, 'min_child_weight': 10, 'max_depth': 6, 'learning_rate': 0.02, 'eval_metric': 'rmse', 'colsample_bytree': 1}
14 373 543.511145 497.539026 {'subsample': 1, 'reg_lambda': 1, 'reg_alpha': 0, 'objective': 'reg:squarederror', 'min_split_loss': 0.0001, 'min_child_weight': 1, 'max_depth': 6, 'learning_rate': 0.02, 'eval_metric': 'rmse', 'colsample_bytree': 1}
10 390 544.175842 500.208081 {'subsample': 1, 'reg_lambda': 2, 'reg_alpha': 1, 'objective': 'reg:squarederror', 'min_split_loss': 0, 'min_child_weight': 1, 'max_depth': 6, 'learning_rate': 0.02, 'eval_metric': 'rmse', 'colsample_bytree': 1}
0 604 544.376392 504.577338 {'subsample': 1, 'reg_lambda': 1, 'reg_alpha': 1, 'objective': 'reg:squarederror', 'min_split_loss': 0.0001, 'min_child_weight': 1, 'max_depth': 5, 'learning_rate': 0.02, 'eval_metric': 'rmse', 'colsample_bytree': 1}
In [ ]:
# Step 3: Fit on best params
best = search_df.iloc[0]

fit = xgb.train(params=best.params, dtrain=dtrain, num_boost_round=best.num_boost_round)

Now, the model is ready to be inspected by evaluating

  • its test performance,
  • by looking at gain importance and
  • a couple of partial dependence plots.
In [ ]:
# Interpret model
import numpy as np
import dalex as dx
import plotly
from sklearn.metrics import r2_score, mean_squared_error as mse
from xgboost import plot_importance

plotly.offline.init_notebook_mode()  # for saving html with plotly plots

# Test performance
pred_test = fit.predict(dtest)
print(f"Test RMSE: {mse(y_test, pred_test, squared=False):.3f}")
print(f"Test R-squared: {r2_score(y_test, pred_test):.2%}")
Test RMSE: 535.562
Test R-squared: 98.18%
In [ ]:
# Variable importance regarding MSE improvement
plot_importance(fit, importance_type="gain", show_values=False, xlabel="Gain")
Out[ ]:
<AxesSubplot: title={'center': 'Feature importance'}, xlabel='Gain', ylabel='Features'>
In [ ]:
# Partial dependence plots on training data
def pred_fun(model, df):
    X = encoder.transform(df)
    dX = xgb.DMatrix(X, feature_names=feature_names)
    return model.predict(dX)


exp = dx.Explainer(fit, df_train, predict_function=pred_fun, verbose=False)

pdp_num = exp.model_profile(
    variable_splits={"carat": np.linspace(0, 3, 41)},
    label="Partial depencence for carat",
    verbose=False,
)
pdp_num.plot()

pdp_ord = exp.model_profile(
    variable_type="categorical",
    variable_splits=dict(zip(ord_features, ord_levels)),
    label="Partial depencence for ordinal variables",
    verbose=False,
)
pdp_ord.plot(facet_scales="free")

Comment: The resulting model seems comparable to the random forest with slightly better performance.

The next code snipped shows how to use XGBoost's scikit-learn interface.

In [ ]:
from xgboost import XGBRegressor
from sklearn.pipeline import make_pipeline

xgb_model = make_pipeline(
    encoder,
    XGBRegressor(
        **best.params, n_estimators=best.num_boost_round, importance_type="gain"
    ),
)
xgb_model.fit(df_train, y_train)
print(f"Test R-squared: {xgb_model.score(df_test, y_test):.2%}")
Test R-squared: 98.18%

Exercises¶

  1. Study the online documentation of XGBoost to figure out how to make the model monotonically increasing in carat. Test your insights without rerunning the grid search in our last example, i.e., just by refitting the final model. How does the partial dependence plot for carat look now?
  2. Develop a strong XGBoost model for the claims data set with binary response clm and covariates veh_value, veh_body, veh_age, gender, area, and agecat. Use a clean cross-validation/test approach. Use log loss as loss function and evaluation metric. Interpret its results.
  3. Optional. Study the documentation of LightGBM. Use LightGBM to develop a competitor to the XGBoost claims model from Exercise 2. The XGBoost code needs to be slightly adapted. Compare grid search time.

Chapter Summary¶

In this chapter, we have met decision trees, random forests and tree boosting. Single decision trees are very easy to interpret but do not perform too well. On the other hand, tree ensembles like the random forest or gradient boosted trees usually perform very well but are tricky to interpret. We have introduced interpretation tools to look into such "black boxes". The main reason why random forests and boosted trees often provide more accurate models than a linear model lies in their ability to automatically learn interactions and other non-linear effects.

Chapter References¶

[1] L. Breiman, J. Friedman, R. Olshen, and C. Stone, "Classification and Regression Trees", Wadsworth, Belmont, CA, 1984.

[2] L. Breiman, "Random forests". In: Machine Learning, 2001, 45(1).

[3] R. Schapire, "The strength of weak learnability", Machine Learning, Vol 5, Nr. 2, 1990.

[4] J. Friedman, "Greedy Function Approximation: A Gradient Boosting Machine", 2001.